-
Notifications
You must be signed in to change notification settings - Fork 2.7k
WIP: Move DateTime native code into managed #21848
Conversation
Some of this is implemented as FCalls to avoid PInvoke overhead. This would need performance testing as well. |
I am aware of that. The history of the repository shows that has been done in other places. However, the BOTR explicitly states
Maybe it's time to have some updated guidance on that. |
@dotnet-bot test Windows_NT arm Cross Debug Innerloop Build |
I've done some preliminary benchmarks and it shows some rather interesting things. Firstly, my initial implementation with
Now, these numbers are not completely representative since it was run on .NET Core 2.1.6 runtime with a code copied to a separate BenchmarkDotNet project. Perhaps there's a better way to benchmark something like this? Another interesting observation is that the |
That seems to be expected: #9736 |
@Suchiman Thanks, I missed that remark in the PR. Looks like .NET Core 2.1.6 doesn't properly optimize the |
I had some flaws in my benchmark. This is current state with .NET Core 3.0 Preview 1 runtime:
Is this worth pursuing further? Maybe there are ways to squeeze the last few nanoseconds back. |
@filipnavara UtcNow is often on the hot path, yes I think it is worth investigating further perf improvements. |
Well, I can scale back the PR and keep the FCalls and move just the rest to shared partition. I am not sure if I will be able to squeeze any more performance out of the P/Invoke approach. |
The performance is closer than what I thought it is going to be. I will take a look at this in detail when I get a chance. |
Thanks! Any advice or help is appreciated. |
There are some particularly bad things in the code gen, which could likely be eliminated:
|
Hmm, can you post more code? I've seen such jumps before but they're not common. |
@mikedn I extracted the PInvoke alone into a small function public unsafe void JustPInvoke()
{
long timestamp;
Interop.Kernel32.GetSystemTimePreciseAsFileTime(×tamp);
} and looked at the generated code (.NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT): ; DateTimeTest.DateTimeBenchmark.JustPInvoke()
push rbp
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbx
sub rsp,78h
lea rbp,[rsp+0B0h]
xor edx,edx
mov qword ptr [rbp-40h],rdx
mov qword ptr [rbp-48h],rdx
mov qword ptr [rbp+10h],rcx
lea rcx,[rbp-80h]
mov rdx,r10
call coreclr+0xad090
mov qword ptr [rbp-48h],rax
mov rcx,rsp
mov qword ptr [rbp-60h],rcx
mov rcx,rbp
mov qword ptr [rbp-50h],rcx
lea rcx,[rbp-40h]
xor r11d,r11d
mov rax,7FF89F1F39D8h
mov qword ptr [rbp-70h],rax
lea rax,[00007ff8`9f0affde]
mov qword ptr [rbp-58h],rax
mov rax,qword ptr [rbp-48h]
lea rdx,[rbp-80h]
mov qword ptr [rax+10h],rdx
mov rax,qword ptr [rbp-48h] ; This is loading a value that was already loaded 4 lines above
mov byte ptr [rax+0Ch],0
call qword ptr [00007ff8`9f1f3b70]
mov rdx,qword ptr [rbp-48h]
mov byte ptr [rdx+0Ch],1
cmp dword ptr [coreclr!g_CLREngineMetrics+0x9080
je M00_L00
call qword ptr [coreclr!CreateCLRProfiling+0x20b188
M00_L00:
mov rax,qword ptr [rbp-48h]
mov rdx,qword ptr [rbp-78h]
mov qword ptr [rax+10h],rdx
jmp M00_L01 ; Unnecessary jump
M00_L01:
mov rax,qword ptr [rbp-48h] ; Unnecessary load (probably not eliminated because of the jump?)
mov byte ptr [rax+0Ch],1
lea rsp,[rbp-38h]
pop rbx
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
...
; Total bytes of code 204 It could very well be something screwed up by BenchmarkDotNet or my .NET Core installation. Smallest repro code here: using System;
using System.Runtime.InteropServices;
public class C {
[DllImport("kernel32", ExactSpelling = true, SetLastError = false)]
internal static unsafe extern void GetSystemTimePreciseAsFileTime(long* lpSystemTimeAsFileTime);
public unsafe void JustPInvoke()
{
long timestamp;
GetSystemTimePreciseAsFileTime(×tamp);
}
} |
You are probably looking at unoptimized tier-0 code; here's what I see for your repro. ; Assembly listing for method C:JustPInvoke():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd
; V01 loc0 [V01 ] ( 1, 1 ) long -> [rbp-0x40] do-not-enreg[X] must-init addr-exposed ld-addr-op
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V03 FramesRoot [V03,T00] ( 7, 7 ) long -> rsi "Pinvoke FrameListRoot"
; V04 PInvokeFrame [V04 ] ( 8, 8 ) blk (64) [rbp-0x80] do-not-enreg[X] addr-exposed "Pinvoke FrameVar"
;
; Lcl frame size = 104
G_M55676_IG01:
push rbp
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbx
sub rsp, 104
lea rbp, [rsp+A0H]
xor rdx, rdx
mov qword ptr [rbp-40H], rdx
G_M55676_IG02:
lea rcx, bword ptr [rbp-78H]
mov rdx, r10
call CORINFO_HELP_INIT_PINVOKE_FRAME
mov rsi, rax
mov rcx, rsp
mov qword ptr [rbp-58H], rcx
mov rcx, rbp
mov qword ptr [rbp-48H], rcx
lea rcx, bword ptr [rbp-40H]
xor r11, r11
mov rax, 0xD1FFAB1E
mov qword ptr [rbp-68H], rax
lea rax, G_M55676_IG04
mov qword ptr [rbp-50H], rax
lea rax, bword ptr [rbp-78H]
mov qword ptr [rsi+16], rax
mov byte ptr [rsi+12], 0
G_M55676_IG03:
call [C:GetSystemTimePreciseAsFileTime(long)]
G_M55676_IG04:
mov byte ptr [rsi+12], 1
cmp dword ptr [(reloc)], 0
je SHORT G_M55676_IG05
call [CORINFO_HELP_STOP_FOR_GC]
G_M55676_IG05:
mov rax, bword ptr [rbp-70H]
mov qword ptr [rsi+16], rax
G_M55676_IG06:
mov byte ptr [rsi+12], 1
G_M55676_IG07:
lea rsp, [rbp-38H]
pop rbx
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
; Total bytes of code 157, prolog size 30 for method C:JustPInvoke():this BDN should be getting you to tier-1 code, but you can force the issue via |
@AndyAyersMS Yeah, there's BDN bug filled for that (dotnet/BenchmarkDotNet#899). I was seeing some weird output with |
@filipnavara you can get that Output via https://github.com/dotnet/coreclr/blob/master/Documentation/building/viewing-jit-dumps.md |
Yeah, we decided in #9736 that the performance difference was worth it for the accuracy. |
Closing for now, thanks for all the valuable feedback. The Unix part was superseded by #22383 and the P/Invoke performance improvements are discussed there and in #22320. |
Note to myself: This should be resurrected now that |
Just testing build and unit tests now. If it works out then the code could be shared between CoreCLR and CoreRT completely.